ROCm และ HIP: เรื่องเล่าแบบละเอียด 10 บท: ลักษณะเน้นหน่วยความจำของประสิทธิภาพการประมวลผลด้วย GPU

ในการเร่งความเร็วผ่าน GPU เราต้องละทิ้งแนวคิด 'คำนวณก่อน' อย่างเด็ดขาด ประสิทธิภาพสมัยใหม่ถูกกำหนดโดย การจัดการหน่วยความจำ: การประสานงานการจัดสรรข้อมูล การซิงโครไนซ์ และการปรับปรุงประสิทธิภาพระหว่างอุปกรณ์หลัก (CPU) และอุปกรณ์ปลายทาง (GPU)

1. ความแตกต่างระหว่างหน่วยความจำและหน่วยประมวลผล

แม้ว่าความสามารถในการประมวลผลทางคณิตศาสตร์ของ GPU ($TFLOPS$) จะพุ่งสูงขึ้นอย่างรวดเร็ว แต่แบนด์วิดธ์หน่วยความจำ ($GB/s$) กลับเติบโตในอัตราที่ช้ากว่ามาก ทำให้เกิดช่องว่างที่หน่วยประมวลผลมักจะ 'ขาดแคลน' ต้องรอข้อมูลจากหน่วยความจำกราฟิก (VRAM) ดังนั้น การเขียนโปรแกรมบน GPU มักจะเป็นการเขียนโปรแกรมที่เน้นหน่วยความจำ។

2. โมเดลโรฟไลน์

โมเดลนี้แสดงความสัมพันธ์ระหว่าง ความเข้มข้นของการคำนวณ (FLOPs/Byte) กับประสิทธิภาพ แอปพลิเคชันส่วนใหญ่มักจะแบ่งออกเป็นสองประเภท:

จำกัดด้วยแบนด์วิดธ์ (เส้นเอียงชันสูง): ถูกจำกัดด้วยแบนด์วิดธ์ (เส้นเอียงชันสูง)
จำกัดด้วยค่าสูงสุดของ TFLOPS (ระนาบแนวนอน): ถูกจำกัดด้วยค่าสูงสุดของ TFLOPS (ระนาบแนวนอน)

3. ภาษีจากการเคลื่อนย้ายข้อมูล

จุดที่ทำให้ประสิทธิภาพลดลงส่วนใหญ่ไม่ใช่การคำนวณ แต่เป็นเวลาหน่วง (latency) และต้นทุนพลังงานในการส่งข้อมูล 1 ไบต์ผ่านบัส PCIe หรือจากหน่วยความจำ HBM โค้ดที่มีประสิทธิภาพสูงจะให้ความสำคัญกับการคงอยู่ของข้อมูล และลดการส่งข้อมูลระหว่างโฮสต์กับอุปกรณ์ให้น้อยที่สุด

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.